Goto

Collaborating Authors

 student model


Distance OP

Neural Information Processing Systems

Conventional KD methods propose various designs to allow student model to imitate the teacher better. However, these MultiScale handcrafted KD designs heavily rely on expert knowledge and may be sub-optimal for various teacher-student pairs.


SLaM: Student-Label Mixing for Distillation with Unlabeled Examples

Neural Information Processing Systems

Knowledge distillation with unlabeled examples is a powerful training paradigm for generating compact and lightweight student models in applications where the amount of labeled data is limited but one has access to a large pool of unlabeled data. In this setting, a large teacher model generates "soft" pseudo-labels for the unlabeled dataset which are then used for training the student model. Despite its success in a wide variety of applications, a shortcoming of this approach is that the teacher's pseudo-labels are often noisy, leading to impaired student performance. In this paper, we present a principled method for knowledge distillation with unlabeled examples that we call Student-Label Mixing (SLaM) and we show that it consistently improves over prior approaches by evaluating it on several standard benchmarks. Finally, we show that SLaM comes with theoretical guarantees; along the way we give an algorithm improving the best-known sample complexity for learning halfspaces with margin under random classification noise, and provide the first convergence analysis for so-called "forward loss-adjustment" methods.



Supplementary Materials for the Paper " L2T-DLN: Learning to Teach with Dynamic Loss Network "

Neural Information Processing Systems

In this supplementary material, we provide the proofs of convergence analysis in Section 1, 1-vs-1 transformation employed in the classification and semantic segmentation tasks in Section 2, the coordinate-wise and the preprocessing method of the LSTM teacher in Section 3, the loss functions of YOLO-v3 in Section 4, more experiments of image classification in Section 5, and the inferences of semantic segmentation in Section 6. A differentiable function e()is L-smooth with gradient Lipschitz constant C (uniformly Lipschitz continuous), if e(x) e(y) C x y, x,y. The function is called block-wise smooth with gradient Lipschitz Ci, if i e(x i,xi) ie(x i,x i) Ci xi x i, x,x (1) or with gradient Lipschitz constants { Ci}, if i e(x i,xi) ie(x i,xi) Ci x i x i, x,x (2) Further, Let Gmax max{Ci, Ci, k} C. Definition 2. For a differentiable function e(), if e(x) = 0, then x is a first-order stationary solution (SS1). For a differentiable function e(), if x is a SS1, and there exists ฯต > 0 so that for any y in the ฯต-neighborhood of x, we have e(x) e(y), then xis a local minimum. A saddle point xis an SS1 that is not a local minimum. If ฮปmin( 2e(x)) < 0, x is a strict (non-degenerate) saddle point.


L2T-DLN: Learning to Teach with Dynamic Loss Network

Neural Information Processing Systems

With the concept of teaching being introduced to the machine learning community, a teacher model start using dynamic loss functions to teach the training of a student model. The dynamic intends to set adaptive loss functions to different phases of student model learning. In existing works, the teacher model 1) merely determines the loss function based on the present states of the student model, i.e., disregards the experience of the teacher; 2) only utilizes the states of the student model, e.g., training iteration number and loss/accuracy from training/validation sets, while ignoring the states of the loss function. In this paper, we first formulate the loss adjustment as a temporal task by designing a teacher model with memory units, and, therefore, enables the student learning to be guided by the experience of the teacher model. Then, with a dynamic loss network, we can additionally use the states of the loss to assist the teacher learning in enhancing the interactions between the teacher and the student model. Extensive experiments demonstrate our approach can enhance student learning and improve the performance of various deep models on real-world tasks, including classification, objective detection, and semantic segmentation scenarios.




description of our method

Neural Information Processing Systems

Algorithm 2 Procedure for estimating the weights 1: procedure ESTIMATEWEIGHTS( Teacher,Student,V,D) 2:.V is the validation dataset and D is the teacher-labeled dataset 3: U, k d12 p |V|e 4: for every (x,y) V do 5: X (Confidence(Teacher(x)),Confidence(Student(x))) 6: if arg max(Teacher(x)) = arg max(y) then: 7: (p,distortion) (0,1) 8: else: B.1 The student's test-accuracy-trajectory In this section we provide extended experimental results that show the student's test accuracy over the training trajectory corresponding to experiments we mentioned in Section 3.1. Notice that in the vast majority of cases our method significantly outperforms the conventional approach almost throughout the training process. The student's test accuracy over the training trajectory using harddistillation corresponding to the experiments of Figure 4. See Section 3.1.2 The student's test accuracy over the training trajectory corresponding to the experiments of Figure 5. See Section 3.1.2 The student's test accuracy over the training trajectory corresponding to the experiments of Figure 7. See Section 3.1.3 The student's test accuracy over the training trajectory using hard-distillation (first row) and soft-distillation (second row) corresponding to the experiments of Figure 8. See Section 3.1.4 Indeed, it is known (see e.g.